10/9/2018

Introduction:

Data exploration in SQL and R on the Babynames data set.

(Also, an exploration in ioslides_presentation in RMarkdown) This is still a work in progress, and due to frame limitations, I can't show everything.

Prelude to SQL: making a big file

Before loading all of the data into MariaDB, I had to merge it all together and make a column for year. I accomplished this in the following cell. Source: Stack Overflow

path <- 
  '/Users/brennandonnell/grad_school/data900/sql/names'
file_names <- 
  dir(path, 
      pattern = "yob" ) 
df <- do.call(
  rbind, 
  lapply(file_names, function(x) cbind(fread(
           x, sep=',', 
           col.names = c('babyname','m/f','frequency'),
           colClasses = list(character=1:2, numeric=3)), 
           year=strsplit(strsplit(x,'yob')[[1]][[2]], '.txt')[[1]])))
df$length <- str_length(unlist(df[[1]]))
write.csv(df, paste0(path,'/fullnames.csv'), sep=',')

Loading mariadb in

Loading Mariadb into mySQL and R. This uses several libraries:

  • DBI
  • dplyr
  • dbplyr
  • RMariaDB
  • odbc
con <- dbConnect(
  drv = RMariaDB::MariaDB(), 
  dbname = 'Babynames',
  username = uname,
  password = pword
)

Creating the schema

this is just the framework for how to create a table in SQL. How I actually did it was by importing the full CSV into Sequel Pro through file -> import

CREATE TABLE demonoo (
  baby_name CHAR(15),
  gender CHAR(1),
  frequency INT(11),
  year INT
)

Query 1

For Query 1, we searched for the most popular male name and female name for that person’s year of birth. The following cell is a small sample of it.

rs <- dbSendQuery(con, 
"SELECT babyname,gender, MAX(frequency), year 
FROM newnames 
GROUP BY gender, year 
ORDER BY year;")
head(dbFetch(rs))
##   babyname gender MAX(frequency) year
## 1     Mary      F           7065 1880
## 2     John      M           9655 1880
## 3     Mary      F           6919 1881
## 4     John      M           8769 1881
## 5     Mary      F           8148 1882
## 6     John      M           9557 1882
dbClearResult(rs)
#dbDisconnect(con)

Query 1: the SQL

It is possible in R Markdown to create chunks of SQL code. This is the same code as the previous slide, just done directly in SQL.

SELECT babyname,gender, MAX(frequency), year 
FROM newnames 
GROUP BY gender, year 
ORDER BY year;
Most popular name by year
babyname gender MAX(frequency) year
Mary F 7065 1880
John M 9655 1880
Mary F 6919 1881

Query 2

For Query 2, we searched for the popularity of a given person’s name for each year from 1879 to 2017.

Naturally, I picked my own perfect name, to see how many people have the same name as mine.

rs <- dbSendQuery(con, 
"SELECT babyname, gender, frequency, year 
FROM newnames WHERE babyname = 'brennan' AND gender = 'm' 
ORDER BY year;")
tail(dbFetch(rs), n=3)
##    babyname gender frequency year
## 67  Brennan      M       687 2015
## 68  Brennan      M       548 2016
## 69  Brennan      M       458 2017
dbClearResult(rs)

Query 2: SQL

Again, R Notebooks and R Markdown have the ability to run SQL code. This is the same code as the cell above.

SELECT babyname, gender, frequency, year FROM newnames 
WHERE babyname = 'brennan' AND gender = 'm' 
ORDER BY year;
Popularity of Brennan by year
babyname gender frequency year
Brennan M 5 1916
Brennan M 6 1949
Brennan M 6 1950

Plotting:

For the following trick, we shall examine some visualizations. But first, we have to group the data.

by_brennan <- df %>%
  group_by(year) %>% 
  filter(babyname == "Brennan") 

Plotting:

More plotting

look at name lengths

by_length <- df %>% 
  filter(length >= 10) %>% 
  group_by(year) %>% 
  summarise(n = n())

more plotting